Analyzing The Data

Before I start any analysis, I like to brain dump a few questions I hope to answer based on the data I’m working with. This gives me a sense of direction for which computations to run, and it helps me keep track of my work if I ever return to old analysis projects. In this lesson, we’re going to answer the following:

  1. Which Texas lawmakers sponsored the most legislation? Who sponsored the least legislation?

    • We’ll also break this down by political party and compare Democrats vs. Republicans.
  2. Looking at the top 10 sponsors of this session: Which parts of the state do they represent?

  3. What does the distribution of bill status look like? (Introduced vs. Engrossed vs. Passed)

    • How many haven’t made any progress at all?
  4. How can we identify bills based on keywords, like “immigration” to name one example?

  5. How many bills have been assigned to each committee? How many committees make up the Texas Senate vs. the House?

This is not an exhaustive list of the questions you could answer with this data, but it’s a good introduction to sorting through large data sets. Let’s take a look at these questions one at a time.

Note

My results might look different than yours throughout this section because I downloaded the data on a different date, meaning yours is more up-to-date than mine. If you ever re-run this analysis at a later date, expect your results to change.

Setup again

Open your analysis.qmd file.

As with your cleaning file, you need to load the libraries that allow you to run certain functions. We’ll be using the same ones. Type out the code below and run it in a new block at the top of the file.

library(tidyverse)
library(janitor)
Important

Don’t forget to label your work and save regularly!

Import the clean data

Remember that rds file you saved in the previous section? Now we want to tell R to read that file so it appears here. To do so, you’ll use a very similar function to read_csv() … which is called read_rds(). As with read_csv, you’ll write a direct path to where the rds file sits in your data-processed folder.

Run read_rds() for the processed data, and save it in an object called “final_table.” After you’ve run that command, “call” the object final_table (meaning run it) so the data appears.

If you need a reminder, the code looks like this:

final_table <- read_rds("data-processed/bills-89.rds")
final_table

If you click through, it should be the exact table you saved at the end of the cleaning process. Right now, you can see the data appears in alphabetical order by lawmaker last name, which is why Rep. Alma Allen comes up for the first 50 rows.

In the next few steps, we’ll use different functions to rearrange the table and look at the data in new ways!

1. Sponsors: most vs. least

Now let’s get into the first question: Who sponsored the most bills this session? To figure this out, youll use three functions — group_by(), summarize() and arrange().

  • group_by() allows us to sort the data into specific groups based on variables that we want to focus on; in this case, we want to group by name, party, role, and district.

  • summarize() computes a summary of the data usually by a mean or a sum; we’ll just be focusing on the sum aspect because we want to count the number of bills each lawmaker has filed.

  • arrange() allows us to determine how we want to data to be presented; we can use this function to order the lawmaker names from most filed to least filed

It’s very common to use these three functions together to run simple computations, so get used to this order of operations.

1.1 Doing the most first

Run the following code to see who sponsored the most bills this session.

final_table |> 
  group_by(name, party, role) |> 
  summarise(number_filed = n()) |> 
  arrange(desc(number_filed))
Tip

If you run the code and you get an error, check the syntax. Do all of your indents match mine? Did you accidentally add in a comma or a parenthesis? Code syntax is extremely important, and most code errors are from very small mistakes in the typing.

Looks like Sen. Bryan Hughes sponsored the most this session. If you click through the table, you’ll see the values in the “number_filed” column decrease. That’s because we added the “desc()” element to our arrange function, which means we want it arranged in descending order.

Note

My exact numbers might look different from yours because I downloaded this data on a different day, so don’t worry if your result isn’t exactly the same as mine.

1.2 Just the top of the list

We’re looking at a lot of rows here, 180 to be exact, so what if we wanted to highlight the top 10 lawmakers who have sponsored the most this session? We can do this with the head() function. Run the code again, this time with one extra line at the end. Save it into an object called most_spons.

most_spons <- final_table |> 
  group_by(name, party, role) |> 
  summarise(number_filed = n()) |> 
  arrange(desc(number_filed)) |> 
  head(10)

Now call most_spons to see the top 10 list.

most_spons

Because we put the number 10 inside the head function, the table now only shows the first 10 rows – or top 10 lawmakers based on number of bills filed. Feel free to mess around with the number inside the head function and see how your result changes.

1.3 Who sponsored the least?

We can easily see who filed the least bills with one small change to the code above. The arrange() function defaults to ascending order when we don’t have that desc() specification in the code. So all we need to do is take that out and run it! Do this in a new code block, not the one you used earlier.

final_table |> 
  group_by(name, party, role) |> 
  summarise(number_filed = n()) |> 
  arrange(number_filed)

Rep. Ramon Romero Jr. sponsored the least legislation this session, with only 10 bills. Like before, let’s just focus on the top 10 rows, which in this case would be the 10 lawmakers who filed the least bills this session. I think you know what to add here!

least_spons <- final_table |> 
  group_by(name, party, role) |> 
  summarise(number_filed = n()) |> 
  arrange(number_filed) |> 
  head(10)
least_spons

1.4 Democrats vs. Republicans

To break this down by political party, we’re going to use the filter() function, which tells R to filter for a certain condition in the data. In this case, the condition is whether party equals D or R.

First we’re going to tell it to filter for Democratic lawmakers by setting party == to “D” (note the double equal sign, this is the proper syntax). Then, we’re telling it to count the number of times party equals D. That tells us the number of bills/resolutions sponsored by a Democratic lawmaker.

Copy and run the code.

final_table |> 
  filter(party == "D") |> 
  count(party)

Repeat this in a new chunk, but filter for Republicans. Try writing it on your own first before viewing my code.

Code
final_table |> 
  filter(party == "R") |> 
  count(party)

You can see 8,969 bills/resolutions were sponsored by Republicans, while only 5,221 were sponsored by Democrats. This makes sense, considering Republicans have the majority in Congress.

Congrats! You just answered the first questions from the brainstorm. And hopefully you’re starting to realize how you can organize a data table and look at it in different ways.

2. Districts

Now that we know who filed the most bills this session, we want to figure out which districts they represent. Our data set has a Texas house or senate district (HD/SD) associated with each lawmaker, so we can use a similar method as above to see where each one comes from. Key things here:

  • We’re adding the district column to the group_by list

  • We’re using the count() function to tell R to count the number of entries for each lawmaker (aka how many bills they sponsored)

  • We’re telling it to arrange the data in descending order

final_table |> 
  group_by(name, district, role, party) |> 
  count(name) |> 
  arrange(desc(n))

Now you can see the congressional district associated with each lawmaker. If you want to know what this looks like on a map (because most people don’t know congressional districts off the top of their head), here’s are maps for the Texas House and Texas Senate.

2.1 Just senators

To just view Texas senators and their districts, you need to filter the role column. The code is the same as above with the new command added after the group_by line.

sen_districts <- final_table |> 
  group_by(name, district, role, party) |> 
  filter(role == "Sen") |> 
  count(name) |> 
  arrange(desc(n))
sen_districts

2.2 Just representatives

You know the drill. Filter for “Rep” instead of “Sen” and try it on your own first.

Code
house_districts <- final_table |> 
  group_by(name, district, role, party) |> 
  filter(role == "Rep") |> 
  count(name) |> 
  arrange(desc(n))
house_districts

Nice! We’re actually going to make our own senate district map later on, so this is an important step.

3. Bill status

LegiScan tracks the status of bills in four ways:

  • No progress: a bill that was filed but not actually introduced to the session (These appear as “NA” in the data)

  • Introduced: approved for review in session and pending in one of the chambers

  • Engrossed: passed one chamber and sent to the other for review

  • Passed: approved by both chambers, headed to the governor’s desk

Let’s see how many bills/resolutions are in each category. This is as of the date the data was last updated by LegiScan, so again my result might look different than yours.

final_table |> 
  group_by(status_desc) |> 
  count(status_desc) |> 
  arrange(desc(n))

As you can see, most bills are at the status Introduced, meaning they’re still being reviewed in the first chamber and haven’t gone to an official vote. Or maybe there was a vote without enough support and the bill died on the floor. If you’re running this analysis much later in the session, or after the fact, you’ll probably see a higher number for engrossed and passed.

You might be thinking, wow a lot of bills made it to the governor. Those are almost exclusively resolutions that memorialize people, commemorate anniversaries or recognize certain days in honor of something. They usually only have to be approved by one chamber, which is why they move through the legislative process so quickly.

4. Keyword detection

Maybe you have a specific bill or keyword in mind that you want to quickly search for. You technically don’t need code to do this. You should be able to see the final_table object listed in the environment window (top right). Click on it to open it in the source window. From there, you can use the search bar to look for specific bill numbers, lawmaker names, keywords in the description, etc.

If you want to create a new table of bills that include certain keywords, though, you’ll do so in a code chunk using a more extensive filtering method called filter(grepl()). Without getting too technical, you’re going to tell R what word to look for and in which column.

Let’s see how many bills have the word “immigration” in the description. Create a code block and run the following:

final_table |> 
  filter(grepl("immigration", description, ignore.case = TRUE))

Now you can see every bill with a description that includes the word “immigration.” If you want to search for more than one word at a time, add “|” in between each word. For example:

final_table |> 
  filter(grepl("immigration|border|visa|refugee", description, ignore.case = TRUE))

4.1 On your own

Experiment with this function by filtering for other keywords. To change the column name you’re filtering, replace “description” with the name of the column.

Note

The LegiScan data comes from the bill description only, not the full bill text. This is important to keep in mind as you search for key words.

5. Committees

There are 54 different committees between the House and the Senate in the 89th session, each responsible for reviewing different types of policy. To see which committee was assigned the most bills, run this code in a new block:

bills_per_committee <- final_table |> 
  group_by(committee) |> 
  summarise(bills = n()) |> 
  arrange(desc(bills))
bills_per_committee

Looks like about half of the bills haven’t been assigned to committee, which makes sense given how many there are in session. In my results, the House State Affairs Committee has the most, but not by much. Other committees that have been assigned a lot of bills include House Public Education and Senate State Affairs. This would be an interesting thing to re-run later in the session, or after it ends, to see how the data changed.

5.1 Number of Senate committees

Filter for every committee name that includes “Senate.”

bills_per_committee |> 
  filter(grepl("Senate", committee, ignore.case = TRUE)) |> 
  count()

5.2 Number of House committees

Now repeat that in a new block, but filter for House.

Code
bills_per_committee |> 
  filter(grepl("House", committee, ignore.case = TRUE)) |> 
  count()

This result makes sense, given there are 150 representatives and only 31 senators.

Saving data for charts

You’re going to use some of the data from this section to build our two charts, which means you need to save those tables to your computer. The write_csv() function takes a specific object and saves it as a csv file to your computer.

Do this with the most_spons object, which is the table of top 10 sponsors that you saved earlier. The format for write_csv is ([object name], “[path to where csv should save]”). Copy and run the following code to save most_spons to your data-processed folder.

write_csv(most_spons, "data-processed/most-spons.csv")

Now do the same for the sen_districts object.

write_csv(sen_districts, "data-processed/sen-districts.csv")

Click inside your data-processed folder to make sure those appear. AND SAVE YOUR ANALYSIS FILE AGAIN.

Export CSVs from posit.cloud

  1. Open the data-processed folder.
  2. Check the boxes for the two csv files you just created.
  3. Click the blue settings button, and hit Export.
  4. Name the zip file something that’s easy to find later, like “processed-lege-tables” and hit Download.

Make sure to do this before moving on to the next section.

Recap

Pat yourself on the back because you just did so much great work! On top of answering all of the questions from the brainstorm, you learned a bunch of functions that are helpful for any data analysis:

  • group_by()

  • summarise()/summarize() … both appear in R and they do the same thing

  • arrange() and arrange(desc())

  • count()

  • filter() and filter(grepl()

You also learned how to save objects and run them again, which is a good way to store the changes you made to the data. Lastly, you learned how to turn objects into csv files using write_csv.

And hopefully you learned something new about the status of the Texas session! In the next section, you’ll create two beautiful charts with no coding needed.